Corpora and data preparation

نویسندگان

Lynn Carlson

Boyan A. Onyshkevych

Mary Ellen Okurowski

چکیده

The data selection and data preparation efforts which led to the TIPSTER and Fifth Message Understandin g Conference (MUC-5) evaluation corpora involved substantial effort, time and resources . The Government commitment to these selection and preparation efforts stems from four TIPSTER Program objectives : (1) to provide trainin g data that would promote the development of information extraction technology, (2) to provide accurate test data t o evaluate and baseline system performance in an objective manner, (3) to provide a baseline for human performance t o understand and interpret machine performance, and (4) to support the larger Natural Language Processing community by making available a unique set of texts and templates in multiple domains and languages under ARPA support . This commitment was demonstrated through the managerial, technical, and administrative support to these efforts from various Government agencies, as well as through the contractual efforts with the Institute for Defense Analyses for data preparation and New Mexico State University for software tool development .

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Corpora and Data Preparation for Information Extraction

The data selection and data preparation efforts which led to the TIPSTER and Fifth Message Understanding Conference (MUC-5) corpora involved substantial effort, time and resources. The Government commitment to these selection and preparation efforts stems from four TIPSTER Program objectives: (1) to provide training data that would promote the development of information extraction technology, (...

متن کامل

The MMSR bilingual and crosschannel corpora for speaker recognition research and evaluation

We describe efforts to create corpora to support and evaluate systems that meet the challenge of speaker recognition in the face of both channel and language variation. In addition to addressing ongoing evaluation of speaker recognition systems, these corpora are aimed at the bilingual and crosschannel dimensions. We report on specific data collection efforts at the Linguistic Data Consortium, ...

متن کامل

Experiments in Medical Translation Shared Task at WMT 2014

This paper describes Dublin City University’s (DCU) submission to the WMT 2014 Medical Summary task. We report our results on the test data set in the French to English translation direction. We also report statistics collected from the corpora used to train our translation system. We conducted our experiment on the Moses 1.0 phrase-based translation system framework. We performed a variety of ...

متن کامل

Design and Preparation of the 1996 Hub-4 Broadcast News Benchmark Test Corpora

This paper describes the procedures used in the preparation of the 1996 DARPA CSR Hub-4 Broadcast News Benchmark Test corpora and some analyses of that data. A new annotation/transcription process was designed and implemented to ensure that the transcripts were practically error-free and to negate the need to hold a post-test Aadjudication@ as in years past. This paper focuses on this new annot...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1993

Corpora and data preparation

نویسندگان

چکیده

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Corpora and Data Preparation for Information Extraction

The MMSR bilingual and crosschannel corpora for speaker recognition research and evaluation

Experiments in Medical Translation Shared Task at WMT 2014

Design and Preparation of the 1996 Hub-4 Broadcast News Benchmark Test Corpora

عنوان ژورنال:

اشتراک گذاری